Gradient Descent

Code

Please visit this page for a code implementation of the fixed step size gradiant method and this page for a code implementation of the steepest gradiant method.

Gradiant Descent

Suppose that we are given a point $\boldsymbol{x}^{(k)}$ . To find the next point $\boldsymbol{x}^{(k+1)}$ , we start at $\boldsymbol{x}^{(k)}$ and move by an amount $-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right)$ , where $\alpha_k$ is a positive scalar called the step size. This procedure leads to the following iterative algorithm:

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right) .

We refer to this as a gradient descent algorithm (or simply a gradient algorithm).

The Method of Steepest Descent

The method of steepest descent is a gradient algorithm where the step size $\alpha_k$ is chosen to achieve the maximum amount of decrease of the objective function at each individual step. Specifically, $\alpha_k$ is chosen to minimize $\phi_k(\alpha) \triangleq f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)$ . In other words,

\alpha_k=\underset{\alpha \geq 0}{\arg \min } f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right) .

Observe that the method of steepest descent moves in orthogonal steps, as stated in the following proposition.

Proposition

If $\left\{\boldsymbol{x}^{(k)}\right\}_{k=0}^{\infty}$ is a steepest descent sequence for a given function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ , then for each $k$ the vector $\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}$ is orthogonal to the vector $\boldsymbol{x}^{(k+2)}-\boldsymbol{x}^{(k+1)}$ .

Proof

From the iterative formula of the method of steepest descent it follows that

\left\langle\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}, \boldsymbol{x}^{(k+2)}-\boldsymbol{x}^{(k+1)}\right\rangle=\alpha_k \alpha_{k+1}\left\langle\nabla f\left(\boldsymbol{x}^{(k)}\right), \nabla f\left(\boldsymbol{x}^{(k+1)}\right)\right\rangle .

To complete the proof it is enough to show that

\left\langle\nabla f\left(\boldsymbol{x}^{(k)}\right), \nabla f\left(\boldsymbol{x}^{(k+1)}\right)\right\rangle=0 .

To this end, observe that $\alpha_k$ is a nonnegative scalar that minimizes $\phi_k(\alpha) \triangleq$ $f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)$ . Hence, using the FONC and the chain rule gives us

\begin{aligned} 0 &=\phi_k^{\prime}\left(\alpha_k\right) \\ &=\frac{d \phi_k}{d \alpha}\left(\alpha_k\right) \\ &=\nabla f\left(\boldsymbol{x}^{(k)}-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)^{\top}\left(-\nabla f\left(\boldsymbol{x}^{(k)}\right)\right) \\ &=-\left\langle\nabla f\left(\boldsymbol{x}^{(k+1)}\right), \nabla f\left(\boldsymbol{x}^{(k)}\right)\right\rangle, \end{aligned}

which completes the proof. The proposition above implies that $\nabla f\left(\boldsymbol{x}^{(k)}\right)$ is parallel to the tangent plane to the level set $\left\{f(\boldsymbol{x})=f\left(\boldsymbol{x}^{(k+1)}\right)\right\}$ at $\boldsymbol{x}^{(k+1)}$ . Note that as each new point is generated by the steepest descent algorithm, the corresponding value of the function $f$ decreases in value, as stated below.

Proposition

If $\left\{\boldsymbol{x}^{(k)}\right\}_{k=0}^{\infty}$ is the steepest descent sequence for $f: \mathbb{R}^n \rightarrow \mathbb{R}$ and if $\nabla f\left(\boldsymbol{x}^{(k)}\right) \neq \mathbf{0}$ , then $f\left(\boldsymbol{x}^{(k+1)}\right)<f\left(\boldsymbol{x}^{(k)}\right)$ .

Proof

First recall that

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right),

where $\alpha_k \geq 0$ is the minimizer of

\phi_k(\alpha)=f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)

over all $\alpha \geq 0$ . Thus, for $\alpha \geq 0$ , we have

\phi_k\left(\alpha_k\right) \leq \phi_k(\alpha) .

By the chain rule,

\phi_k^{\prime}(0)=\frac{d \phi_k}{d \alpha}(0)=-\left(\nabla f\left(\boldsymbol{x}^{(k)}-0 \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)\right)^{\top} \nabla f\left(\boldsymbol{x}^{(k)}\right)=-\left\|\nabla f\left(\boldsymbol{x}^{(k)}\right)\right\|^2<0

because $\nabla f\left(\boldsymbol{x}^{(k)}\right) \neq \mathbf{0}$ by assumption. Thus, $\phi_k^{\prime}(0)<0$ and this implies that there is an $\bar{\alpha}>0$ such that $\phi_k(0)>\phi_k(\alpha)$ for all $\alpha \in(0, \bar{\alpha}]$ . Hence,

f\left(\boldsymbol{x}^{(k+1)}\right)=\phi_k\left(\alpha_k\right) \leq \phi_k(\bar{\alpha})<\phi_k(0)=f\left(\boldsymbol{x}^{(k)}\right),

which completes the proof.

Analysis of Gradient Methods

Let us now see what the method of steepest descent does with a quadratic function of the form

f(\boldsymbol{x})=\frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}-\boldsymbol{b}^{\top} \boldsymbol{x},

where $\boldsymbol{Q} \in \mathbb{R}^{n \times n}$ is a symmetric positive definite matrix, $\boldsymbol{b} \in \mathbb{R}^n$ , and $\boldsymbol{x} \in \mathbb{R}^n$ . The unique minimizer of $f$ can be found by setting the gradient of $f$ to zero, where

\nabla f(\boldsymbol{x})=\boldsymbol{Q} \boldsymbol{x}-\boldsymbol{b},

because $D\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)=\boldsymbol{x}^{\top}\left(\boldsymbol{Q}+\boldsymbol{Q}^{\top}\right)=2 \boldsymbol{x}^{\top} \boldsymbol{Q}$ , and $D\left(\boldsymbol{b}^{\top} \boldsymbol{x}\right)=\boldsymbol{b}^{\top}$ . There is no loss of generality in assuming $Q$ to be a symmetric matrix. For if we are given a quadratic form $\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}$ and $\boldsymbol{A} \neq \boldsymbol{A}^{\top}$ , then because the transposition of a scalar equals itself, we obtain

\left(\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}\right)^{\top}=\boldsymbol{x}^{\top} \boldsymbol{A}^{\top} \boldsymbol{x}=\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x} .

Hence,

\begin{aligned} \boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x} &=\frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}+\frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{A}^{\top}\boldsymbol{x} \\ &=\frac{1}{2} \boldsymbol{x}^{\top}\left(\boldsymbol{A}+\boldsymbol{A}^{\top}\right) \boldsymbol{x} \\ & \triangleq \frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x} . \end{aligned}

Note that

\left(\boldsymbol{A}+\boldsymbol{A}^{\top}\right)^{\top}=\boldsymbol{Q}^{\top}=\boldsymbol{A}+\boldsymbol{A}^{\top}=\boldsymbol{Q} .

The Hessian of $f$ is $\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{Q}=\boldsymbol{Q}^{\top}>0$ . To simplify the notation we write $\boldsymbol{g}^{(k)}=\nabla f\left(\boldsymbol{x}^{(k)}\right)$ . Then, the steepest descent algorithm for the quadratic function can be represented as

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{g}^{(k)},

where

\begin{aligned} \alpha_k &=\underset{\alpha \geq 0}{\arg \min } f\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right) \\ &=\underset{\alpha \geq 0}{\arg \min }\left(\frac{1}{2}\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)-\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{b}\right) . \end{aligned}

In the quadratic case, we can find an explicit formula for $\alpha_k$ . We proceed as follows. Assume that $\boldsymbol{g}^{(k)} \neq \mathbf{0}$ , for if $\boldsymbol{g}^{(k)}=\mathbf{0}$ , then $\boldsymbol{x}^{(k)}=\boldsymbol{x}^*$ and the algorithm stops. Because $\alpha_k \geq 0$ is a minimizer of $\phi_k(\alpha)=f\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)$ , we apply the FONC to $\phi_k(\alpha)$ to obtain

\phi_k^{\prime}(\alpha)=\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{Q}\left(-\boldsymbol{g}^{(k)}\right)-\boldsymbol{b}^{\top}\left(-\boldsymbol{g}^{(k)}\right) .

Therefore, $\phi_k^{\prime}(\alpha)=0$ if $\alpha \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}=\left(\boldsymbol{x}^{(k) \top} \boldsymbol{Q}-\boldsymbol{b}^{\top}\right) \boldsymbol{g}^{(k)}$ . But

\boldsymbol{x}^{(k) \top} \boldsymbol{Q}-\boldsymbol{b}^{\top}=\boldsymbol{g}^{(k) \top} .

Hence,

\alpha_k=\frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}} .

In summary, the method of steepest descent for the quadratic takes the form

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}} \boldsymbol{g}^{(k)},

where

\boldsymbol{g}^{(k)}=\nabla f\left(\boldsymbol{x}^{(k)}\right)=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b} .

Convergence

We can investigate important convergence characteristics of a gradient method by applying the method to quadratic problems. The convergence anaylysis is more convenient if instead of working with $f$ we deal with

V(\boldsymbol{x})=f(\boldsymbol{x})+\frac{1}{2} \boldsymbol{x}^{* \top} \boldsymbol{Q} \boldsymbol{x}^*=\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{x}^*\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}-\boldsymbol{x}^*\right),

where $\boldsymbol{Q}=\boldsymbol{Q}^{\top}>0$ . The solution point $\boldsymbol{x}^*$ is obtained by solving $\boldsymbol{Q} \boldsymbol{x}=$ $\boldsymbol{b}$ ; that is, $\boldsymbol{x}^*=\boldsymbol{Q}^{-1} \boldsymbol{b}$ . The function $V$ differs from $f$ only by a constant $\frac{1}{2} \boldsymbol{x}^{* \top} \boldsymbol{Q} \boldsymbol{x}^*$ . We begin our analysis with the following useful lemma that applies to a general gradient algorithm.

Lemma 1

The iterative algorithm

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{g}^{(k)}

with $\boldsymbol{g}^{(k)}=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b}$ satisfies

V\left(\boldsymbol{x}^{(k+1)}\right)=\left(1-\gamma_k\right) V\left(\boldsymbol{x}^{(k)}\right),

where if $\boldsymbol{g}^{(k)}=\mathbf{0}$ , then $\gamma_k=1$ , and if $\boldsymbol{g}^{(k)} \neq \mathbf{0}$ , then

\gamma_k=\alpha_k \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}}\left(2 \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}-\alpha_k\right) .

Proof

The proof is by direct computation. Note that if $\boldsymbol{g}^{(k)}=\mathbf{0}$ , then the desired result holds trivially. In the remainder of the proof, assume that $\boldsymbol{g}^{(k)} \neq \mathbf{0}$ . We first evaluate the expression

\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}

To facilitate computations, let $\boldsymbol{y}^{(k)}=\boldsymbol{x}^{(k)}-\boldsymbol{x}^*$ . Then, $V\left(\boldsymbol{x}^{(k)}\right)=$ $\frac{1}{2} \boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}$ . Hence,

\begin{aligned} V\left(\boldsymbol{x}^{(k+1)}\right) &=\frac{1}{2}\left(\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right) \\ &=\frac{1}{2}\left(\boldsymbol{x}^{(k)}-\boldsymbol{x}^*-\alpha_k \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}^{(k)}-\boldsymbol{x}^*-\alpha_k \boldsymbol{g}^{(k)}\right) \\ &=\frac{1}{2} \boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}-\alpha_k \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}+\frac{1}{2} \alpha_k^2 \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)} \end{aligned}

Therefore,

\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}=\frac{2 \alpha_k \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}-\alpha_k^2 \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}{\boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}}

Because

\boldsymbol{g}^{(k)}=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b}=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{Q} \boldsymbol{x}^*=\boldsymbol{Q} \boldsymbol{y}^{(k)}

we have

\begin{aligned} &\boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}=\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}, \\ &\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}=\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)} \end{aligned}

Therefore, substituting the above, we get

\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}=\alpha_k \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}}\left(2 \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}-\alpha_k\right)=\gamma_k .

Note that $\gamma_k \leq 1$ for all $k$ , because $\gamma_k=1-V\left(\boldsymbol{x}^{(k+1)}\right) / V\left(\boldsymbol{x}^{(k)}\right)$ and $V$ is a nonnegative function. If $\gamma_k=1$ for some $k$ , then $V\left(\boldsymbol{x}^{(k+1)}\right)=0$ , which is equivalent to $\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^*$ . In this case we also have that for all $i \geq k+1$ , $\boldsymbol{x}^{(i)}=\boldsymbol{x}^*$ and $\gamma_i=1$ . It turns out that $\gamma_k=1$ if and only if either $\bar{g}^{(k)}=\mathbf{0}$ or $\boldsymbol{g}^{(k)}$ is an eigenvector of $\boldsymbol{Q}$ (see Lemma 3).

We are now ready to state and prove our key convergence theorem for gradient methods. The theorem gives a necessary and sufficient condition for the sequence $\left\{\boldsymbol{x}^{(k)}\right\}$ generated by a gradient method to converge to $\boldsymbol{x}^*$ ; that is, $\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*$ or $\lim _{k \rightarrow \infty} \boldsymbol{x}^{(k)}=\boldsymbol{x}^*$ .

Theorem 1

Let $\left\{\boldsymbol{x}^{(k)}\right\}$ be the sequence resulting from a gradient algorithm $\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{g}^{(k)}$ . Let $\gamma_k$ be as defined in Lemma 1, and suppose that $\gamma_k>0$ for all $k$ . Then, $\left\{\boldsymbol{x}^{(k)}\right\}$ converges to $\boldsymbol{x}^*$ for any initial condition $\boldsymbol{x}^{(0)}$ if and only if

\sum_{k=0}^{\infty} \gamma_k=\infty

Proof

From Lemma 1 we have $V\left(\boldsymbol{x}^{(k+1)}\right)=\left(1-\gamma_k\right) V\left(\boldsymbol{x}^{(k)}\right)$ , from which we obtain

V\left(\boldsymbol{x}^{(k)}\right)=\left(\prod_{i=0}^{k-1}\left(1-\gamma_i\right)\right) V\left(\boldsymbol{x}^{(0)}\right)

Assume that $\gamma_k<1$ for all $k$ , for otherwise the result holds trivially. Note that $\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*$ if and only if $V\left(\boldsymbol{x}^{(k)}\right) \rightarrow 0$ . By the equation above we see that this occurs if and only if $\prod_{i=0}^{\infty}\left(1-\gamma_i\right)=0$ , which, in turn, holds if and only if $\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty$ (we get this simply by taking logs). Note that by Lemma 1, $1-\gamma_i \geq 0$ and $\log \left(1-\gamma_i\right)$ is well-defined $[\log (0)$ is taken to be $-\infty]$ . Therefore, it remains to show that $\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty$ if and only if

\sum_{i=0}^{\infty} \gamma_i=\infty

We first show that $\sum_{i=0}^{\infty} \gamma_i=\infty$ implies that $\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty$ . For this, first observe that for any $x \in \mathbb{R}, x>0$ , we have $\log (x) \leq x-1$ [this is easy to see simply by plotting $\log (x)$ and $x-1$ versus $x$ ]. Therefore, $\log \left(1-\gamma_i\right) \leq 1-\gamma_i-1=-\gamma_i$ , and hence $-\log \left(1-\gamma_i\right) \geq \gamma_i$ . Thus, if $\sum_{i=0}^{\infty} \gamma_i=\infty$ , then clearly $\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty$ .

Finally, we show that $\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty$ implies that $\sum_{i=0}^{\infty} \gamma_i=\infty$ . We proceed by contraposition. Suppose that $\sum_{i=0}^{\infty} \gamma_i<\infty$ . Then, it must be that $\gamma_i \rightarrow 0$ . Now observe that for $x \in \mathbb{R}, x \leq 1$ and $x$ sufficiently close to 1 , we have $\log (x) \geq 2(x-1)$ [as before, this is easy to see simply by plotting $\log (x)$ and $2(x-1)$ versus $x$ ]. Therefore, for sufficiently large $i$ , $\log \left(1-\gamma_i\right) \geq 2\left(1-\gamma_i-1\right)=-2 \gamma_i$ , which implies that $-\log \left(1-\gamma_i\right) \leq 2 \gamma_i$ . Hence, $\sum_{i=0}^{\infty} \gamma_i<\infty$ implies that $\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)<\infty$ . This completes the proof.

Lemma 2

Let $\boldsymbol{Q}=\boldsymbol{Q}^{\top}>0$ be an $n \times n$ real symmetric positive definite matrix. Then, for any $\boldsymbol{x} \in \mathbb{R}^n$ , we have

\frac{\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})} \leq \frac{\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)^2}{\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)\left(\boldsymbol{x}^{\top} \boldsymbol{Q}^{-1} \boldsymbol{x}\right)} \leq \frac{\lambda_{\max }(\boldsymbol{Q})}{\lambda_{\min }(\boldsymbol{Q})}

Proof for Lemma 2

Applying Rayleigh's inequality and using the properties of symmetric positive definite matrices listed previously, we get

\frac{\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)^2}{\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)\left(\boldsymbol{x}^{\top} \boldsymbol{Q}^{-1} \boldsymbol{x}\right)} \leq \frac{\|\boldsymbol{x}\|^4}{\lambda_{\min }(\boldsymbol{Q})\|\boldsymbol{x}\|^2 \lambda_{\min }\left(\boldsymbol{Q}^{-1}\right)\|\boldsymbol{x}\|^2}=\frac{\lambda_{\max }(\boldsymbol{Q})}{\lambda_{\min }(\boldsymbol{Q})}

and

\frac{\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)^2}{\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)\left(\boldsymbol{x}^{\top} \boldsymbol{Q}^{-1} \boldsymbol{x}\right)} \geq \frac{\|\boldsymbol{x}\|^4}{\lambda_{\max }(\boldsymbol{Q})\|\boldsymbol{x}\|^2 \lambda_{\max }\left(\boldsymbol{Q}^{-1}\right)\|\boldsymbol{x}\|^2}=\frac{\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})}

We are now ready to establish the convergence of the steepest descent method.

Theorem 2

In the steepest descent algorithm, we have $\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*$ for any $\boldsymbol{x}^{(0)}$ .

Proof

If $\boldsymbol{g}^{(k)}=\mathbf{0}$ for some $k$ , then $\boldsymbol{x}^{(k)}=\boldsymbol{x}^*$ and the result holds. So assume that $\boldsymbol{g}^{(k)} \neq \mathbf{0}$ for all $k$ . Recall that for the steepest descent algorithm,

\alpha_k=\frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}} .

Substituting this expression for $\alpha_k$ in the formula for $\gamma_k$ yields

\gamma_k=\frac{\left(\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}\right)^2}{\left(\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}\right)\left(\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}\right)} .

Note that in this case $\gamma_k>0$ for all $k$ . Furthermore, by Lemma 8.2, we have $\gamma_k \geq\left(\lambda_{\min }(\boldsymbol{Q}) / \lambda_{\max }(\boldsymbol{Q})\right)>0$ . Therefore, we have $\sum_{k=0}^{\infty} \gamma_k=\infty$ , and hence by Theorem $8.1$ we conclude that $\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*$ .

Consider now a gradient method with fixed step size; that is, $\alpha_k=\alpha \in \mathbb{R}$ for all $k$ . The resulting algorithm is of the form

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)} .

We refer to the algorithm above as a fixed-step-size gradient algorithm. The algorithm is of practical interest because of its simplicity. In particular, the algorithm does not require a line search at each step to determine $\alpha_k$ , because the same step size $\alpha$ is used at each step. Clearly, the convergence of the algorithm depends on the choice of $\alpha$ , and we would not expect the algorithm to work for arbitrary $\alpha$ . The following theorem gives a necessary and sufficient condition on $\alpha$ for convergence of the algorithm.

Theorem 3

For the fixed-step-size gradient algorithm, $\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*$ for any $\boldsymbol{x}^{(0)}$ if and only if

0<\alpha<\frac{2}{\lambda_{\max }(\boldsymbol{Q})} .

Proof

$\Leftarrow:$ By Rayleigh's inequality we have

\lambda_{\min }(\boldsymbol{Q}) \boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)} \leq \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)} \leq \lambda_{\max }(\boldsymbol{Q}) \boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}

and

\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)} \leq \frac{1}{\lambda_{\min }(\boldsymbol{Q})} \boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)} .

Therefore, substituting the above into the formula for $\gamma_k$ , we get

\gamma_k \geq \alpha\left(\lambda_{\min }(\boldsymbol{Q})\right)^2\left(\frac{2}{\lambda_{\max }(\boldsymbol{Q})}-\alpha\right)>0 .

Therefore, $\gamma_k>0$ for all $k$ , and $\sum_{k=0}^{\infty} \gamma_k=\infty$ . Hence, by Theorem $8.1$ we conclude that $\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*$ . $\Rightarrow$ : We use contraposition. Suppose that either $\alpha \leq 0$ or $\alpha \geq 2 / \lambda_{\max }(\boldsymbol{Q})$ . Let $\boldsymbol{x}^{(0)}$ be chosen such that $\boldsymbol{x}^{(0)}-\boldsymbol{x}^*$ is an eigenvector of $\boldsymbol{Q}$ corresponding to the eigenvalue $\lambda_{\max }(\boldsymbol{Q})$ . Because

\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha\left(\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b}\right)=\boldsymbol{x}^{(k)}-\alpha\left(\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{Q} \boldsymbol{x}^*\right),

we obtain

\begin{aligned} \boldsymbol{x}^{(k+1)}-\boldsymbol{x}^* &=\boldsymbol{x}^{(k)}-\boldsymbol{x}^*-\alpha\left(\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{Q} \boldsymbol{x}^*\right) \\ &=\left(\boldsymbol{I}_n-\alpha \boldsymbol{Q}\right)\left(\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right) \\ &=\left(\boldsymbol{I}_n-\alpha \boldsymbol{Q}\right)^{k+1}\left(\boldsymbol{x}^{(0)}-\boldsymbol{x}^*\right) \\ &=\left(1-\alpha \lambda_{\max }(\boldsymbol{Q})\right)^{k+1}\left(\boldsymbol{x}^{(0)}-\boldsymbol{x}^*\right) \end{aligned}

where in the last line we used the property that $\boldsymbol{x}^{(0)}-\boldsymbol{x}^*$ is an eigenvector of $\boldsymbol{Q}$ . Taking norms on both sides, we get

\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\|=\left|1-\alpha \lambda_{\max }(\boldsymbol{Q})\right|^{k+1}\left\|\boldsymbol{x}^{(0)}-\boldsymbol{x}^*\right\| .

Because $\alpha \leq 0$ or $\alpha \geq 2 / \lambda_{\max }(\boldsymbol{Q})$

\left|1-\alpha \lambda_{\max }(\boldsymbol{Q})\right| \geq 1

Hence, $\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\|$ cannot converge to 0 , and thus the sequence $\left\{\boldsymbol{x}^{(k)}\right\}$ does not converge to $\boldsymbol{x}^*$ .

Convergence Rate

We now turn our attention to the issue of convergence rates of gradient algorithms. In particular, we focus on the steepest descent algorithm. We first present the following theorem.

Theorem 4

In the method of steepest descent applied to the quadratic function, at every step $k$ we have

V\left(\boldsymbol{x}^{(k+1)}\right) \leq \frac{\lambda_{\max }(\boldsymbol{Q})-\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})} V\left(\boldsymbol{x}^{(k)}\right)

Proof

In the proof of Theorem 2, we showed that $\gamma_k \geq \lambda_{\min }(\boldsymbol{Q}) / \lambda_{\max }(\boldsymbol{Q})$ . Therefore,

\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}=\gamma_k \geq \frac{\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})}

and the result follows.

Theorem 4 is relevant to our consideration of the convergence rate of the steepest descent algorithm as follows. Let

r=\frac{\lambda_{\max }(\boldsymbol{Q})}{\lambda_{\min }(\boldsymbol{Q})}=\|\boldsymbol{Q}\|\left\|\boldsymbol{Q}^{-1}\right\|,

called the condition number of $\boldsymbol{Q}$ . Then, it follows from Theorem $8.4$ that

V\left(\boldsymbol{x}^{(k+1)}\right) \leq\left(1-\frac{1}{r}\right) V\left(\boldsymbol{x}^{(k)}\right) .

The term $(1-1 / r)$ plays an important role in the convergence of $\left\{V\left(\boldsymbol{x}^{(k)}\right)\right\}$ to 0 (and hence of $\left\{\boldsymbol{x}^{(k)}\right\}$ to $\left.\boldsymbol{x}^*\right)$ . We refer to $(1-1 / r)$ as the convergence ratio. Specifically, we see that the smaller the value of $(1-1 / r)$ , the smaller $V\left(\boldsymbol{x}^{(k+1)}\right)$ will be relative to $V\left(\boldsymbol{x}^{(k)}\right)$ , and hence the "faster" $V\left(\boldsymbol{x}^{(k)}\right)$ converges to 0 , as indicated by the inequality above. The convergence ratio $(1-1 / r)$ decreases as $r$ decreases. If $r=1$ , then $\lambda_{\max }(\boldsymbol{Q})=\lambda_{\min }(\boldsymbol{Q})$ , corresponding to circular contours of $f$ (see Figure 8.6). In this case the algorithm converges in a single step to the minimizer. As $r$ increases, the speed of convergence of $\left\{V\left(\boldsymbol{x}^{(k)}\right)\right\}$ (and hence of $\left\{\boldsymbol{x}^{(k)}\right\}$ ) decreases. The increase in $r$ reflects that fact that the contours of $f$ are more eccentric (see, e.g., Figure 8.7). We refer the reader to [88, pp. 238, 239] for an alternative approach to the analysis above. To investigate the convergence properties of $\left\{\boldsymbol{x}^{(k)}\right\}$ further, we need the following definition.

Definition

Given a sequence $\left\{\boldsymbol{x}^{(k)}\right\}$ that converges to $\boldsymbol{x}^*$ , that is, $\lim _{k \rightarrow \infty}\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|=0$ , we say that the order of convergence is $p$ , where $p \in \mathbb{R}$ , if

0<\lim _{k \rightarrow \infty} \frac{\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\|}{\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|^p}<\infty .

If for all $p>0$

\lim _{k \rightarrow \infty} \frac{\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\|}{\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|^p}=0,

then we say that the order of convergence is $\infty$ . Note that in the definition above, $0 / 0$ should be understood to be 0 . The order of convergence of a sequence is a measure of its rate of convergence; the higher the order, the faster the rate of convergence. The order of convergence is sometimes also called the rate of convergence (see, e.g., [96]). If $p=1$ (first-order convergence) and $\lim _{k \rightarrow \infty}\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\| /\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|=1$ , we say that the convergence is sublinear. If $p=1$ and $\lim _{k \rightarrow \infty} \| \boldsymbol{x}^{(k+1)}-$ $\boldsymbol{x}^*\|/\| \boldsymbol{x}^{(k)}-\boldsymbol{x}^* \|<1$ , we say that the convergence is linear. If $p>1$ , we say that the convergence is superlinear. If $p=2$ (second-order convergence), we say that the convergence is quadratic.

Code​

Gradiant Descent​

The Method of Steepest Descent​

Proposition​

Proof​

Proposition​

Proof​

Analysis of Gradient Methods​

Convergence​

Lemma 1​

Proof​

Theorem 1​

Proof​

Lemma 2​

Proof for Lemma 2​

Theorem 2​

Proof​

Theorem 3​

Proof​

Convergence Rate​

Theorem 4​

Proof​

Definition​

Code

Gradiant Descent

The Method of Steepest Descent

Proposition

Proof

Proposition

Proof

Analysis of Gradient Methods

Convergence

Lemma 1

Proof

Theorem 1

Proof

Lemma 2

Proof for Lemma 2

Theorem 2

Proof

Theorem 3

Proof

Convergence Rate

Theorem 4

Proof

Definition